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We performed a large-scale crawl of the World Wide Web, covering 6.9 Million domains, including 
all high-traffic sites of the Internet. We present a study of the correlations found between quantities 
measuring the structural relevance of each node in the network (the in- and out-degree, the local 
clustering coefficient, the first-neighbor in-degree and the Alexa rank). We find that some of these 
properties show strong correlation effects and that the dependencies occurring out of these corre- 
lations follow power laws not only for the averages, but also for the boundaries of the respective 
density distributions. In addition, these scale-free limits do not follow the same exponents as the 
corresponding averages. In our study we retain the directionality of the hyperlinks and develop a 
statistical estimate for the clustering coefficient of directed graphs. 

We include in our study the correlations between the in-degree and the Alexa traffic rank, a pop- 
ular index for the traffic volume, finding non-trivial power-law correlations. We find that sites with 
more/less than about 10 3 links from different domains have remarkably different statistical proper- 
ties, for all correlation functions studied, indicating towards an underlying hierarchical structure of 
the World Wide Web. 

PACS numbers: 89.20.Hh 89.75.-k 



I. INTRODUCTION 

The emergence of the World Wide Web (WWW) be- 
longs arguably to the most relevant events of the present 
time. The interest in this system and in networks in gen- 
eral permeated through all the society, including physics. 
This led, at the turn of the century, to a large amount 
of studies of what with the time came to be known as 
"network science". Most studies of the WWW were per- 
formed, however, in the early 2000s [IH3] and large-scale 
studies of the WWW are rather hard to find nowadays, 
despite the immense growth of the Internet in the last 10 
years. 

A remarkable finding of the first generation studies of 
the WWW is the emergence of scale-free degree distribu- 
tions, which can be explained potentially from the view 
of preferential attachment, although the exponents ob- 
tained are not universal [3]. Generally, one can assume 
that the growth process of a complex network will be in- 
fluenced by inter-node correlations and that these depen- 
dencies will be reflected in the resulting network topol- 
ogy. However, such correlations are not easy to detect 
and characterize, and have not been studied in depth. It 
is expected that a simple rule as preferential attachment 
cannot completely reproduce the structures found in real- 
world networks, and therefore more complicated models 
have been developed to replicate the behavior [4H5] . 

Correlations between different properties are generally 
used as a proxy to study the internal structure of the net- 
work. For instance, Vespagnani studied correlations be- 
tween the in-degree of a node and that of a first neighbor 
of said node [6j, showing a scale free property (recently 
modeled by Takagi [9]), Barabasi and Albert studied the 
local clustering coefficient as a function of the in-degree 
[10J, in order to obtain information regarding the hier- 




FIG. 1: Left: A node with in-degree kin = 4 and out-degree 
kout — 3. Right: Two types of in-degree clusters, with the 
edges always directed towards the central site (A). 



archical structuring of the network. However, real-world 
data about said correlations is not abundant. 



In the present work we study the complete dominant 
core of the WWW by crawling 6.9 Million domains, in- 
cluding all domains with the largest traffic (all domains 
with an Alexa rank of one Million or less are included). 
Collapsing the data, by neglecting link multiplicities, we 
study the network of inter-domain hyperlinks (not web- 
pages), containing about half a Billion directed edges. 
We find non-trivial correlations between in- and out- 
degree, between the in-degree and the local clustering 
coefficient and between the degrees of neighboring sites. 
In addition to evaluate averaged quantities, we study the 
full density plots, finding novel scaling features for the 
boundaries of several correlation functions. We present, 
in addition, a formula for the clustering coefficient of ran- 
dom directed graph characterized by given arbitrary in- 
and out-degree sequences. Finally we present an analysis 
of the correlations between the number of in-links and 
the Alexa rank of a domain. 
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II. THEORY 

For directed graphs we have to distinguish between 
the distribution Pi n (k) and p ou t(l) of the in- and the 
out-degrees k and respectively. There are, in addi- 
tion, two kinds of nearest neighbors, in-neighbors and 
out-neighbors. Site Bl in Fig. [T] is a nearest in-neighbor 
of site A and a nearest out-neighbor of site B2. Alterna- 
tively one could call Bl an ancestor of A and a descendent 
of B2 [llj. For bi-directional links, as between B3 and B4 
in Fig. [T] in-neighbors are also out-neighbors. The total 
number of in-links equals the total number of out-links, 
the in- and out-degree coordination numbers 



(1) 



are hence identical. 



A. Clustering coefficient model for directed graphs 

In order to calculate the relevance of correlations be- 
tween in- and out- degree in the structure of the network, 
we have developed a statistical model of the clustering 
coefficient for given distributions of in- and out- degree 
which are uncorrelated. 

We define with 



Qout(l) 



(2) 



the excess distribution [12] of outgoing links of a near- 
est in-neighbor. The normalization constant N q = 
J2ilPout(l) is just the coordination number z, see ([!]). 
Equivalent ly we define via 



Qin(k) — ^ ^ Pin,out(k, /) / 



(3) 



the degree distribution (not excess) of incoming links of a 
nearest in-neighboring site. Here Pi n ,out(k, I) is the prob- 
ability that a site has I out-links and k in-links (joint 
distribution function), with the usual relations 



^ S Pin,out{k-> I) — PoutiP)i 
k 

^ ^ Pin,out(k, /) = Pin(k) 



(4) 
(5) 



for the marginal distribution functions. The normaliza- 
tion constant Nq in ([3| is given by the coordination num- 
ber z, 

Nq = ^2^2pin,out(k,l)l = ^ I Pout (I) = Z. 
k I I 

For the clustering coefficient C (the 'hat' symbol stands 
here for the clustering coefficient of a directed graph) we 



now consider two in-neighbors, having respectively, with 
probabilities qi n (k) and q ou t(l), k in-links and / excess 
out-links (as stubs). 

We now assume that the distributions qi n (k) and 
Qout(l) of the two neighbors are independent of each 
other. The probability, for a graph with N nodes, that 
a given pair of in- and out-stubs are connected is then 
l/(Nz), where Nz is the total number of in- or out-stubs, 
and hence 



^ = Jj^^2<lin(k)klq out (l) 



k,l 



= ^3 k P^out(k, 1)1^ (^Poutil + 1)1(1 + 1) j 

Transforming now into a sum over sites, every site s 
being characterized by an in-degree k s and out-degree l s , 
one obtains 



C 



(Is - 1% 



(6) 



which coincides with the usual expression [TTj for non- 
directed graphs (apart from a factor l s instead of l s — 1 
in the first factor), by taking k s = l s for s = 1, . . . , N. 
A fully-connected network results in C = 1 under this 
formula. 

We note that the expression ^ for C may actually 
violate the sum rule C < 1, due to the neglect of inter-site 
degree correlations, when applied to a real- world graph. 
As an example consider a network composed out of a 
single star, like the site C in Fig.[TJ but with bi-directional 
edges. For a un-directed (and loopless) star the degree 
sequence is 



fci =h = N- 1, 



kj, — lj, — 1 , 



z = 2,...,7V , 



with an intensive coordination number z = 2(N—1)/N ~ 
2. The statistical formula (|6| for the clustering coefficient 
would, one the other hand, diverge 



1 / 1 



A/"2 3 



:((7V-l) 2 + (7V-l))y 



TV 



in the thermodynamic limit N — >• oo. A substantial devi- 
ation of C from the true clustering coefficient is hence a 
measure for the strength of inter-site degree correlations, 
the expression |6| being valid for graphs with vanishing 
inter-node correlations. 



III. RESULTS 

Using the crawlers of the former file search en- 
gine FindFiles.net [13J we crawled, mostly in 2011, 
6.9 Million domains (of type http://www.domain.com) 
with a total of 64 Million subdomains (of type 




p(k, k ou t) in log scale 



FIG. 2: Complementary cumulative distributions P{k) = 
p{k')dk' for the in-degree (main panel) and the out-degree 
(inset), log-log plot. The dashed lines, corresponding to 
power- law distributions, have slopes 7 = — 1.3 and 7 = —1.4, 
respectively for the in- and the out-degree. 



http : / / subdomain . domain . com) . These 6.9 Million do- 
mains have 223 Million hyperlinks in between them, link- 
ing in addition to 50 Million other sites. For the network 
analysis we neglected these 50 Million external sites, as 
we did not crawl them separately. The network of 223 
Million inter-domain directed links has an average de- 
gree of 32 and 0.7 Million of the 6.9 Million domains are 
isolated in the sense that they have no in-links, they can- 
not be reached from the core of the World Wide Web. 
A further one Million sites have just a single hyperlink 
directed to them. 

The crawling strategy started from the set of the 
about 32 Million subdomains referred-to in Wikipedia 
and DMOZ (all languages), with further systematic ad- 
ditional extensions. We included, in particular, the one 
Million domains with the largest traffic volume, in terms 
of the Alexa rank. This data set, which we denote with 
FF-2011, hence corresponds essentially to the complete 
relevant part of the World Wide Web, in terms of traffic 
volume. 



A. In- and out-degree distributions 

The degree distribution of hyperlinks have been ob- 
served to follow a power law ~ /c 7 , with an exponent 
close to the limiting case 7^—2 (when the mean degree 
would diverge in the thermodynamic limit) [2) 13) 1X4) [T5] . 
In Fig. [2] we present the complementary cumulative dis- 
tribution functions [16J for both the in-degree and the 
out-degree. 

Over a range of about 2.5-3 orders of magnitude, the 
data can be approximated quite nicely by power law dis- 
tributions, with exponents 7^ n = —2.3 and 7 on ^ = — 2.4 
respectively for the in- and the out-degree. These results 
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FIG. 3: Density distribution Pi n ,out(k, k ou t) of domains with 
in-degree k and out-degree k ou t> The density is shown in 
log scale, as well as both axes. The solid line represents 
the average (k ou t)(k). The probability density is normalized, 
J f Pin,out(k, k ut) dkdk out — !• 



confirm earlier studies [131 El HE] finding consistently 
I Tin I < I Tout |- The absolute magnitude of the values re- 
ported for the scaling exponents vary slightly from study 
to study, either because of the evolution of the Internet 
with time passing, or due to the size of the respective 
databases. 



B. Correlations between in- and out-degree 

In Fig. [3J the density distribution of nodes having an 
in-degree k and an out-degree k outl is presented, together 
with the average out-degree (k out )(k), for sites having an 
in-degree k. In- and out-degree do not seem to be par- 
ticularly correlated, on a first sight. However, the av- 
erage out-degree shows two regimes with approximated 
power-law scaling, for k < 10 3 and k > 10 3 , with expo- 
nents 7 = —0.6 and 7 = —1.2 respectively. In the case 
that the joint distribution Pi nj0 ut(k, k out ) would factorize, 
Pin,out(k,k out ) Pin(k)pout(k ut), the mean out-degree 



(k ou t)(k) 



= Jp(k,l) 



Idl ->> p in (k)z 



(7) 



would functionally follow the in-degree distribution 
Pin(k), where z ~ 32 is the average (in- and out-) de- 
gree of our Internet data. However, as shown in Fig. [2| 
the marginal in-degree distribution Pi n (k), falls approxi- 
mately like /c -2,3 , viz substantially faster than ^ would 
imply. In- and out-degree are hence non-trivially corre- 
lated. We will discuss the nature of the respective corre- 
lations in more detail further below when discussing the 
distribution of local clustering coefficients. 
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C. Mean clustering coefficient 

The local clustering coefficient C{ is given by the num- 
ber of linked nearest neighbors of site z, relative to the 
total number of possible links between the neighbors. For 
directed graphs there are in- and out-neighbors and var- 
ious possible 3-site loops, as illustrated in Fig. [T] also 
known as network motifs \T§\ Ul) . Here we examine 
the in-clustering coefficient. For a given site i the in- 
clustering coefficient C{ is given by the average number 
of links in between the in-neighbors of site i. In Fig. [I] the 
sites (B1,A,B2) form an in- loop of site A, contributing to 
CU, while the sites (B4,A,B3) contribute two in- loops. 
We focus on the in-clustering coefficient since the num- 
ber of in-links is a measure for the importance of a site, 
contributing to its traffic volume. 

We find, for the FF-2011 network data, a mean clus- 
tering coefficient C = Ei Q/N of C = 0.18. This is, for 
two reasons, a surprising high value. Firstly the connec- 
tion probability p is very low, being just p = 4.6 x 10 -6 . 
Secondly a quite large number of sites, 0.27%, has a van- 
ishing local clustering Ci = 0, and only a small fraction, 
0.3%, of domains, mostly with small degrees, have a max- 
imal local clustering coefficient of unity. 

We can assess the impact of correlations on the for- 
mation of local loops by considering identical degree se- 
quences for the in- and out- degree, as extracted from 
the FF-2011 network data, but considering various types 
of correlations between the in- and out- degree of each 
node. 

• Applying Eq. ([6| to the actual network, the ob- 
tained value amounts to C 1 model = 1.5. This value is 
over the maximum (7=1, indicating towards very 
strong correlations between the in- and the out- 
degree distributions, compare the discussion below 
Eq. ©. 

• For a network having the same degree distributions 
Pin(k) and p ou tQ) for the in- and the out-degree as 
the actual network, but without correlations be- 
tween these degrees, viz assuming a joint proba- 
bility distribution p in ,out(k,l) Pi n (k)p out {l) , the 
clustering coefficient obtained by Eq. ([6| would 
amount tO Cdecorr = 2.4 x 10 -3 . 

• A network where the in- and out- degrees are 
anticorrelated (nodes with largest in-degree are 
mapped to the smallest out-degree) , would amount 
to an even lower C a nticorr — 3.2 x 10 -4 . 

• For a network with a maximally correlated distribu- 
tion of in- and out- degree (nodes with the largest 
in-degree being mapped to the largest out-degree), 
would result again in a higher-than-unity clustering 
coefficient C maxcorr = 3.2, when using Eq. 

We hence conclude that the in- and out-degree are 
quite strongly correlated positively for the World Wide 
Web. 



p(k, C) in log scale 
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FIG. 4: Probability density P(k, C) of pairs of in-degree k 
and local clustering coefficient C. The density, and both axes 
are given in log scales. The solid line represents the average 
value (C)(k), as a function of in-degree k. The probability 
density is normalized, f f P(k,C)dkdC = 1. 



D. Distribution of local clustering coefficients 

In Fig. [4| the density of (fc, C) pairs is shown in a 
log- log scale, where k is the in-degree and C the local 
clustering coefficient. The density distribution has upper 
and lower cutoffs scaling approximatively like ~ /c 7 , with 
Imax ~ —1.3 and jrnin = 2. The lower limit has a 
simple explanation. The lowest non-zero local clustering 
coefficient is realized when just a single loop exist out of 
the k(k — 1) possible triangles, 

Cmin = W^rj K k ~ 2 ' (8) 

when setting the number of loops n\ to one. The expo- 
nent of the upper limit, jmax = —1.3, implies, compare 
(J8|, that the number of local loops scales like ~ k 0,7 . We 
have presently no explanation for this scaling behavior. 
The average value of (C)(fc), as a function of in-degree 
follows mostly a power law for small k < 10 3 , with an 
exponent 7 = —0.26. For larger k > 10 4 the exponent 
changes toward 7 = —1 for the mean local clustering coef- 
ficient. This last exponent is in agreement with previous 
observations found in [10J, and are a fingerprint for a hi- 
erarchical network structure. The change in behavior at 
the point k = 10 3 is also observable in the correlation be- 
tween the in-degree and the degree of nearest neighbors, 
as we will show in the next sections. 

There is a group of nodes with very high clustering 
coefficients C — 1 around the k ~ 10 3 region (close to 
where the upper limit with the j m ax — — 1 slope inter- 
sects the abscissa), which somewhat falls of the line. Af- 
ter analyzing some of the domains involved in this region, 
we conclude that this group of nodes does not represent 
the intrinsic network structure of the WWW, belonging 
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p{k, k nn ) in log scale 
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FIG. 5: The probability density distribution P(k,k nn ) of 
the in-degree k and the in-degree k nn of nearest neighbor 
sites. Both the density distribution and the axis are given 
in log scale. The solid line corresponds to the average value 
(knn)(k). The probability distribution is normalized, such 
that J f P(k,knn)dkdk n n — 1. 



most probably to link farms. These nodes are however 
responsible for the jumps in (C)(k) at k ~ 10 3 . 



p(k, A) in log scale 
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FIG. 6: Density of pairs (A, k), where A is the Alexa index 
and k the corresponding in-degree of the domain. The density 
is given in log scale, as well as both axes. The solid line shows 
the average (k)(A), as a function of A. The probability is 
normalized, such that f f P(k, A)dkdA = 1. 



network studied, although it might be possible that the 
evolution of the WWW in the last 10 years is responsible 
for the structural change. 



E. Nearest-neighbor degree correlations 

In Fig. [5j the density of pairs (fc, k nn ) is shown, where 
k nn is the in-degree of a first neighbor, and k the in- 
degree. The dots with higher densities for low k values 
are relatively large groups of linked domains which share 
exactly the same pairs of in-degree and first- neighbor in- 
degree. The domains do not seem to be particularly re- 
lated although we do not discard the possibility that they 
may belong to link farms, as they clearly stand out of the 
general behavior of the density distribution. 

When analyzing the average (k nn )(k) as a function of 
the in-degree /c, we observe a very weak increase for small 
k until k « 10 3 . We can fit this increase fairly good with 
a power law of exponent 7 « 0.1. This behavior would 
be in agreement with the one observed in the canonical 
Barabasi- Albert model [2j[T4], though it differs with with 
a 1998 WWW network study [14]. 

In the range from k & 10 3 to k « 10 6 we observe a 
change in the behavior of the average (k nn ) (fc), as it starts 
decaying with increasing k. This decay follows a power- 
law as well, with an exponent of about 7 « —0.3. This 
decay is closer to the results found in [14] for a subset of 
the 1998 Internet data and the fitness model developed 
therein (which decays with 7 = —0.5). However, the 
decay is observed in our results for much higher degree 
k than in [14] . which has data limited to k < 10 3 . We 
speculate that this difference is due to the size of the 



F. Correlations between in-degree and Alexa index 

We have analyzed the correlations of the Alexa rank 
[2T] with respect to the in-degree k. The Alexa rank is 
arguably one of the most popular measures of the traffic 
received by an Internet site, and so, its relevance. The 
ranking is proprietary, so the general public does not have 
access to the specifics of its calculation, although accord- 
ing to the official information, it is derived from the traffic 
observed, with data partly retrieved from users who in- 
stalled the Alexa add-on to their web browser [22] . In this 
ranking, the site with the most traffic has rank A = 1, the 
following largest rank A = 2 and so on. The rank does 
not provide any information about the precise amount 
of traffic, such that a larger A index does not give any 
indication of how much less traffic does that site receives, 
but rather only that it receives less traffic than the sites 
with smaller A. 

In Fig. |6j we present the density of domains as a func- 
tion of its in-degree k and Alexa rank A. We only ana- 
lyze the Alexa rank for sites having an in-degree k > 20, 
with a few exceptions, due to constraints in retrieving 
the Alexa rank data. We observe a distribution limited 
from above and below by two power laws with exponents 
lupper = —0.4 and ji OW er ~ —1.7. The lower limit is 
however less pronounced, due to the lack of samples. 

The solid line in Fig. [6] shows the average in-degree 
(k)(A), for sites having and Alexa rank A. We observe a 
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all densities in log scale. 



very marked power law decay 

W ~ ^ ( A ) ~ ^8 • ( 9 ) 

The exponents are not the inverse of each other, since 
{A) and (k) are distinct averages. There is a saturation 
at k ~ 10 6 for the scaling regime, presumable due to 
our constraint k > 20 for the Alexa index. We find this 
scaling particularly interesting, since the Alexa rank is 
not derived directly from the topology of the network 
but rather from the traffic generated by users. For site 
administrators the relatively weak decay ([9| implies that 
the traffic generated by in-coming hyperlinks can be a 
relevant contribution to the overall traffic volume. 



marginal distributions, i.e. that p(k,k nn ) « p(k)p(k nn ), 
and that therefore k and k nn would be essentially decor- 
related. This is, however, not the case, as we can see 
in Fig. [7]:. In this plot the correlations are seen more 
clearly in terms of the relative joint distribution We can 
observe that the resulting relative distribution still shows 
a stronger correlation when both in-degree and the in- 
degree of the neighbor are small. 

In Fig. [7]i, we present the relative density distribution 
of for sites having an Alexa rank A and an in-degree k. 
As for the case of the clustering coefficient C, the distri- 
bution maintains the very marked upper and lower lim- 
its, being otherwise essentially only slightly more uniform 
than the orginal data shown in Fig. [6j 



G. Normalization Studies 

We have performed a visualization study of the data 
discussed hitherto by considering relative joint distribu- 
tion functions, which are obtained by dividing a given 
joint probability distribution by the product of the re- 
spective marginal distributions, 



p(x) 



J p(x,y)dy, p(y) = J p(x,y)dx . 



p(x)p(y) ' 

In the absence of correlations, viz when p(x, y) — >• 
p(x)p(y), the respective density plots would be homo- 
geneous and flat. 

In Fig. [7^i, we show the relative density 
p(k,C)/(p(k)p(C)) for pairs of in-degree k and lo- 
cal clustering coefficient C. The distribution is quite 
homogeneous and the upper and lower limits of the 
distribution are exalted in comparison with the plot of 
the bar joint distribution presented in Fig. [4j 

In Fig. [7]3, we present the relative joint density for the 
correlation between in- and out-degree. The distribution 
is considerably more homogeneous than the respective 
bare probability density shown in Fig. [3] However, a 
substantial enhancement remains for small in- and out- 
degrees. 

From the shape of the joint nearest-neighbor degree 
distribution presented in In Fig. [5j it would be tempt- 
ing to think that its shape is mostly determined by the 



IV. LINKS FROM 10 3 DISTINCT DOMAINS 

The present study shows that many properties of the 
WWW are characterized by non-trivial correlations. We 
observe that the joint probability distributions, for sev- 
eral of the properties tested, follow power-law scaling for 
the respective averages. Additionally, the distributions 
have, in many instances, density distributions which are 
limited by power laws. The power law limiting functions 
do not share exponents neither with that of the marginal 
distributions, nor with the respective average value of the 
property studied. Interestingly one also observes power- 
law scaling for a seemingly unrelated quantity, the Alexa 
traffic rank, which decays as a function of in-degree and 
also the average in-degree decays moderately weakly as 
a function of the Alexa rank. 

We found that the statistical properties of the World 
Wide Web differ remarkably for domains receiving 
more/fewer than about 10 3 hyperlinks from different do- 
mains. The change in behavior is observed for the cor- 
relations between in- and out-degree, between in-degree 
and local clustering coefficient and between in-degree and 
the in-degree of neighbors. This observation points to- 
wards an underlying hierarchical structure of the WWW, 
with the "elite" of the Internet domains, receiving links 
from more than one thousand different domains, being 
made-up by about 20 • 10 3 sites. 
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